• 1. Introduction
  • 2. Data Collection
  • 3. Exploratory Analysis
  • 4. Feature Engineering
  • 5. Regression and Validation
  • 6. Application
  • 7. Conclusion

1. Introduction

Trains are considered as one of the most significant transportations in Belgium. Passengers complaint about the experiences on trains as it is hard to find seats and avoid the crowdedness. It will cause dramatic waste in time as well as resources. For instance, passenger may miss the train if it is too crowded. According to the report, National Railway Company of Belgium has been continuing to loss profit during the recent years. Belgium government as well as National Railway Company of Belgium are intended to develop an app with specific algorithm to predict the occupancy of each train. As developers, we are selected as one of teams to help to design this functional app.

The app will be served as a daily train information hub for passenger to access occupancy, times, price, origin, destination, prediction accuracy and other historical data. It will help the government and planners to better understand the train system. The company could allocate the vehicle types, resources and adjust price based on the occupancy to ensure the maximum profit. On the other hand, passenger can have more options and flexible schedules for taking the trains.

Below is the algorithm we developed to better predict the train occupancy. The following steps have been organized:data collection, date wrangling, exploratory analysis, spatial process, feature engineering as well as modeling and validation.

Below are some packages to be loaded as well as functions such as knn.

mapTheme <- function(base_size = 10) {
  theme(
    text = element_text(color = "black"),
    plot.title = element_text(size = 12,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    axis.title = element_blank(),
    strip.background = element_rect(fill = "white", color = "white"),
    strip.text.x = element_text(size = 8),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank()
  )
}

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  axis.ticks=element_blank())
q5 <- function(variable) {as.factor(ntile(variable, 5))}
qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                          c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

nn_function <- function(measureFrom,measureTo,k) {
  
  nn <-   
    FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
  
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint)
  
  return(output)  
}

Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

2. Data Collection

As the developers, we downloaded the data from a great Kaggle dataset including a training and test csv and a station location dataset. we set up the right coordination and then load the Belgium map with stations locations. Some of the train stations are in Belgium and some of them are in other countries.

load train occupancy data and station locations

Data Cleaning

To better use the train dataset, we modify and delete some of the useless information. Since each station can be the arrival as well as the departure, we split the dataset and combine stations and trains twice to create trn_j based on OD pairs. We calculate the distance based on the coordinate system.

calculate distance between each stations

Below is the code to seperate the times based on hours, week and days.

categorize vehicle

Download other data and join it to the corresponding datasets

As one of the factors to influence the train operation, weather datasets were downloaded and modified based on months.The train data only covers the months from July to October.In addition, weather graphics are generated to sumamrize temperature, percipitation and wind speed for every hour between July and October.

Download and show the population density data

Besiders all these existing datasets, we consider train station facilities as one of the factors. Below are the steps to wrangle the facility datasets based on each station. We sum up all the facilities including vending machines, ramp, bike racks, wheelchair, elevators, etc.

3. Exploratory Analysis

In order to explore the relationship between times and train numbers or occupancy. We create several maps below to understand connection within these variables.

## Observations: 2,250
## Variables: 18
## $ date       <fct> 2016-08-02, 2016-10-07, 2016-10-18, 2016-10-15, 201...
## $ time       <fct> 10:44:44 AM, 05:31:27 PM, 08:14:07 PM, 08:11:53 PM,...
## $ connection <fct> 8814001/20160802/TGV9826, 008814001/20161007/THA936...
## $ from       <chr> "008814001", "008814001", "008814001", "008844628",...
## $ to         <chr> "008731901", "008727100", "008727100", "008891702",...
## $ vehicle    <fct> TGV9826, THA9360, THA9388, IC533, L2579, IC535, IC2...
## $ occupancy  <fct> medium, medium, high, low, low, medium, high, low, ...
## $ from.X     <dbl> 147730.27, 147730.27, 147730.27, 268014.50, 266545....
## $ from.Y     <dbl> 169477.16, 169477.16, 169477.16, 148494.98, 37574.3...
## $ to.X       <dbl> 227324.059, 2292.706, 2292.706, 49210.254, 153657.2...
## $ to.Y       <dbl> -652412.88, -45947.92, -45947.92, 214120.22, 211917...
## $ distance   <dbl> 825735.1, 259923.2, 259923.2, 228433.7, 207700.3, 1...
## $ datetime   <dttm> 2016-08-02 10:44:44, 2016-10-07 17:31:27, 2016-10-...
## $ interval60 <dttm> 2016-08-02 10:00:00, 2016-10-07 17:00:00, 2016-10-...
## $ interval15 <dttm> 2016-08-02 10:30:00, 2016-10-07 17:30:00, 2016-10-...
## $ week       <int> 31, 41, 42, 42, 32, 39, 41, 41, 41, 41, 39, 44, 34,...
## $ dotw       <fct> 3, 6, 3, 7, 4, 3, 5, 6, 1, 1, 6, 5, 5, 7, 7, 7, 5, ...
## $ veh        <chr> "TGV", "THA", "THA", "IC", "L", "IC", "IC", "IC", "...

Temporal Trend

A map shows the trains counts by day of the week based on occupancy. Morning and after work hours are the busiest periods in a day. All the trip numbers increase. Thursday and Friday also are signficant factors to affect the train shifts.

Similar to the prior map, this map below illustrates the numbers of occupancy levels based on hours. It is clear to see the peak time period during the day.

We also plotted weather trend in a similar fashion. It is interesting to see that the visibility relates to the number of trips greatly.

This plot shows that trips cluster around warm temperatures.

Spatial Trend

In our spatial process, we analysis the model by both space and time. We firstly split the dataset trn_j into from and to data.Vehicle numbers and types are vital variables to determine the capacible as well as the occupancy of the trains. Below, we extract 25 of the most frequent train numbers and map out on the belgium basemap. It is clear to see the train systems and how they operate from the animation simulation.

load rail data

In the next following steps, we would like to explore the current train occupancy average conditions on each station based on from and to situations. As we split the original dataset into two separate data, value 3, 2, 1 are given the occupancy rate:high,medium and low. Also, in order to easily process the visualization, days are categorized into 1-7.By calculating the average number of each station through day 1-7 in each week, we generate the following maps. It is interesting to see that some major cities such as Brussel, have high occupancy as destinations on weekend and early week days.

4. Feature Engineering

In the feature engineering process, we would like explore the relationship between each station. A couple of questions have been focused on:

  • How the nearby stations can affect the occupancy of certain station?
  • Will station density become a important variable in this prediction?
  • For each station, will the connection lines and capacity affect the occupancy?
  • Could it be possible that more closer to large cities such as Brussel, higher occupancy may occur?

As a side note, we realized that since the data does not have enough entries for all the hourly time interval for each spalial feature, we are not going to use space-time panel.

Spatial Process

relevel occupancy

The next step is to analysis the buffer around each station. We generate the total numbers of stations based on certain distance. In order to better use this variable in our model prediction, buffer amounts are categorized into several levels.

Through using knn function, we select the nearest 3 stations and calculate the average distance of these 3 surroundings. In order to meet the requirments in OC pairs conditions, we join the data again based from and to station in this knn function. By looking at the distribution for both origin and destination, we filter the stationdist within 20000 and 15000, for from and to respectively.We also categorize the stationdist to modify the result.

In the next following steps, we focus on the question that how the nearby stations could affect the station itself. To solve this problem, we again use the knn function to calculate the average distance of nearest 5 stations. A functino has been created to calculate the probability of high, medium and low occupancy based on these 5 surrounding stations. Eventually, we generate 3 occupancy probabilities for each station and add as variables for future modeling.

According to the Belgium demographic, we select 5 major stations as a factor to determine the distance. Brussels, Liege, Charleroi,Gent and Antwepen, as well as their coordinate are stored in a data frame. The average distance majc_nn5to these five cities are calculate for each station.

The below steps calcualte the distance from each station to each major city.

Can population density become a factor to affect the model? Maybe! our group import the Belgium regional and province population data and spatial join into station dataset based on ID.

We are using the population density of NUTS (Nomenclature of Territorial Units for Statistics) level 3 units from the European Union as the statistical region. Below is a map showing NUTS 3 boundaries.

What is the occupancy compostiion for each station, especially for high and low occupancy? Below is the code to calculate the high and low rates for each station based on from and to situation.

trn_j_h <- trn_j[occupancy=='high']
trn_j_l <- trn_j[occupancy=='low']

sta_from_total <- as.data.frame(table(trn_j$from)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_f=as.character(Var1)) %>%
  mutate(F_total=Freq) %>%
  select(3,4)

sta_from_h <- as.data.frame(table(trn_j_h$from)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_f=as.character(Var1)) %>%
  mutate(F_H_fre=Freq) %>%
  select(3,4)

sta_from_l <- as.data.frame(table(trn_j_l$from)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_f=as.character(Var1)) %>%
  mutate(F_L_fre=Freq) %>%
  select(3,4)  


sta_to_total <- as.data.frame(table(trn_j$to)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_to=as.character(Var1)) %>%
  mutate(To_total=Freq) %>%
  select(3,4)

sta_to_h <- as.data.frame(table(trn_j_h$to)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_to=as.character(Var1)) %>%
  mutate(To_H_fre=Freq) %>%
  select(3,4)

sta_to_l <- as.data.frame(table(trn_j_l$to)) %>% 
  filter(Var1!='(null)') %>% 
  mutate(station_to=as.character(Var1)) %>%
  mutate(To_L_fre=Freq) %>%
  select(3,4) 

  
stations_w <- merge(stations,sta_from_total,by.x="station_name",by.y='station_f',all.x=T)
stations_w <- merge(stations_w,sta_from_h,by.x="station_name",by.y='station_f',all.x=T)
stations_w <- merge(stations_w,sta_from_l,by.x="station_name",by.y='station_f',all.x=T)

stations_w <- merge(stations_w,sta_to_total,by.x="station_name",by.y='station_to',all.x=T)
stations_w <- merge(stations_w,sta_to_h,by.x="station_name",by.y='station_to',all.x=T)
stations_w <- merge(stations_w,sta_to_l,by.x="station_name",by.y='station_to',all.x=T)

stations_w$F_H_com <- stations_w$F_H_fre/stations_w$F_total
stations_w$F_L_com <- stations_w$F_L_fre/stations_w$F_total
stations_w$To_H_com <- stations_w$To_H_fre/stations_w$To_total
stations_w$To_L_com <- stations_w$To_L_fre/stations_w$To_total
stations_w <- select(stations_w,1,15,16,17,18)

Then we calculate the mean occupancy composition of near stations based on the train lines given back this link.

Additional feature engineering step is to see how many stations are connected with certain station. For example, we would like to know as a destination, how many lines are connected to Brussels, the same as origin. Below is the code to create this useful variable for each station.

Temporal Process

Next, some additional feature engineering is undertaken to mine the data for critical time trends. Like time lags is incredibly important when predicting time series trends. To capture these trends, several different time lags are created. arrange sorts the data by space then time and lag returns the occupancy for the previous time period, hypothesizing that occupanct in time t is in part, a function of Trip_Count in time t - n. The train data contains only August to October, which barely have holidays. In this case, we decide to only calculate the time lags instead of holiday effect in this model.

5. Regression and Validation

After finished all the process above, we decide to use the multinominal logistic regression to predict the train occupancy,because we have three independent variables: high, medium and low. It is ideal to use multinominal logistic regression since the typical logistic regression can only predict binary conditions. A couple of steps below to load some function for model running. The model will be validated on ROC and mean AUC based on sensitivity and specificity.

First, we define the prediction table or the confusion matrix by using two parameters, Hplus and Lplus, which represents the amount to add to the original predicted probability of each occupancy level. In this way, we are able to change the cut-off threshold for each model.

A function to calculate pseudo ROC is presented below.

Based on the ROC defined aboved, we define a pseudo AUC value for prediction metrics.

Another function is defined to display accuracy, specificity, and sensitivity.

We combine and merge all the useful variables together into trn_hjz dataset based on OD pairs. It will be our final regression model dataset.

##  [1] "X"                    "to"                   "interval60"          
##  [4] "from"                 "vehicle"              "occupancy"           
##  [7] "distance"             "interval15"           "week"                
## [10] "dotw"                 "veh"                  "F_H_com"             
## [13] "F_L_com"              "F_connect"            "To_H_com"            
## [16] "To_L_com"             "To_connect"           "country.code.x"      
## [19] "avg_stop_times.x"     "station.Buffer.x"     "station_buffer_cat.x"
## [22] "stationdist.x"        "stationdist_pct_from" "nb_occ_f_low"        
## [25] "nb_occ_f_high"        "majc_nn5.x"           "Brussels.x"          
## [28] "Liege.x"              "Charleroi.x"          "Gent.x"              
## [31] "Antwerpen.x"          "FID.x"                "Value.x"             
## [34] "facility_sum.x"       "country.code.y"       "avg_stop_times.y"    
## [37] "station.Buffer.y"     "station_buffer_cat.y" "stationdist.y"       
## [40] "stationdist_pct_to"   "nb_occ_t_low"         "nb_occ_t_high"       
## [43] "majc_nn5.y"           "Brussels.y"           "Liege.y"             
## [46] "Charleroi.y"          "Gent.y"               "Antwerpen.y"         
## [49] "FID.y"                "Value.y"              "facility_sum.y"      
## [52] "temperature.x"        "humidity.x"           "windspeed.x"         
## [55] "visibility.x"         "weather_type.x"       "key.x"               
## [58] "temperature.y"        "humidity.y"           "windspeed.y"         
## [61] "visibility.y"         "weather_type.y"       "key.y"               
## [64] "occ_num"              "time_of_day"          "lagHour"             
## [67] "lag2Hours"            "lag3Hours"            "lag4Hours"           
## [70] "lag12Hours"

First, we do a test run for a sinple model with few variables. We omit all the NA in our dataset and put all the varibles in the regression model.

## # weights:  1554 (1034 variable)
## initial  value 2434.524832 
## iter  10 value 2176.385487
## iter  20 value 1804.823846
## iter  30 value 1703.876172
## iter  40 value 1654.284196
## iter  50 value 1638.944026
## iter  60 value 1633.050790
## iter  70 value 1631.170943
## iter  80 value 1630.563907
## iter  90 value 1630.387794
## iter 100 value 1630.318719
## final  value 1630.318719 
## stopped after 100 iterations

Let us have a look at the resulting confusion matrix, which contains all three occupancy level. It is clear to see the True_Positive,True_Negative,False_Negative,False_Positive in this matrix.

##          f_occ
## mlt.class low medium high
##    low    665    149  169
##    medium 114    361  128
##    high   133    120  411

Below the function allows one to print out the accuracy, sensitivity and specificity for high, medium, and low occupancy. The mean_AUC function prints out the area under the pseudo ROC curve. Note that the result of mean AUC is greater than 1 because of the way we calculate it.

## [[1]]
## [1] 0.6386667
## 
## [[2]]
## [1] 0.5805085
## 
## [[3]]
## [1] 0.5730159
## 
## [[4]]
## [1] 0.7291667
## 
## [[5]]
## [1] 0.5805085
## 
## [[6]]
## [1] 0.5730159
## 
## [[7]]
## [1] 0.7291667
##   Total_auc.mean High_auc.mean Med_auc.mean Low_auc.mean
## 1      0.8951092     0.3331952    0.2792488    0.2826652

In general, the model is not so accurate. Sensitivity and specificity are not so large, with with high and medium occupancy sensitivity and specificity slightly higher than 0.5. However, this model predicts low occupancy more accurately.

Next we decide to use stepwise regression. Stepwise regression consists of iteratively adding and removing predictors, in the predictive model, in order to find the subset of variables in the data set resulting in the best performing model, that is a model that lowers prediction error. We use the resultant variables as a base to further fine tune which variables to use

After a few feature selection, we arrived at this model, which includes both the spatial and temporal features we created above.

## # weights:  4185 (2788 variable)
## initial  value 2425.735933 
## iter  10 value 2346.948047
## iter  20 value 2083.855191
## iter  30 value 1271.262773
## iter  40 value 857.407373
## iter  50 value 636.197519
## iter  60 value 563.975209
## iter  70 value 532.460071
## iter  80 value 522.485159
## iter  90 value 518.052252
## iter 100 value 515.527999
## final  value 515.527999 
## stopped after 100 iterations
##          f_occ
## mlt.class low medium high
##    low    816     32   29
##    medium  64    540   53
##    high    32     58  626
## [[1]]
## [1] 0.8808889
## 
## [[2]]
## [1] 0.8841808
## 
## [[3]]
## [1] 0.8571429
## 
## [[4]]
## [1] 0.8947368
## 
## [[5]]
## [1] 0.8841808
## 
## [[6]]
## [1] 0.8571429
## 
## [[7]]
## [1] 0.8947368
##   Total_auc.mean High_auc.mean Med_auc.mean Low_auc.mean
## 1       1.723582     0.5940068     0.582342    0.5472329

Let us plot the ROC curve. The y-axis represents the thresholds for low occupancy, the x-axis represents the thresholds for high occupancy, and the color represents the distance between ROC and the coin-flip line. Therefore, the darker the color, the better model.

Accuracy and Generalizability

First, we select the variables we are going to use.

After that, a function is defined to run the cross validation based on spatial features.

We will be using the NUTS level 3 statistical unit for the spatial cross validation.NUTS represents “Nomenclature of Territorial Units for Statistics”, which is a statistical boundary for European Union. Since our data might cover multiple European countries, it is the best to such unit for cross validation.

Based on the above three maps and the cross validation output, one can tell that our model is robust across space, with only little variation in the metrics.

But is our model predicts the three levels equally well?

The model is pretty even to all three occupancy levels.

6. Application

Revenue vs. Occupancy

As we mentioned in the introduction, this prediction model is intended to use for passengers to better save their time and provide convenient experience. In this case, more passengers would like to use this app to get information about occupancy and take the specific trains if our model is high, and vice versa. The company and government will in turn generate more profit with the increased number of passengers.

The code below is the function to calculate the total passenger who will continue use the APP based on our predction accuracy.

Customer Acquisition KPI

Based on our 3x3 matrix, the count is the prediction result from our model. We assume that the total passengers in this case will be equal to the total trains numbers in our model. The probability of taking is based on the senario that wheather the passengers are willing to take the train once they receive the prediction occupancy. The result is that wheather they are willing to continue use this app and take the train after getting off the train. The number is the total passengers after promoting this app since we assume the count will be the same as passengers.

We multiply the count x prob x result and plus the count x (1-result) to generate the total passengers numbers.

Now we can plot the customer acquisition, i.e. the number of passengers continue using our app. The graph below is colored by the number of retaining customers. Both axis shows the threshold for high and low occupancy cut off. The graph shows that higher “Lplus” (increase by about 0.5) will lead to more customer and thus more profit, assuming the price does not change. Recall that Lplus is defined as the amount to be added to the predicted probability of each occupancy level, in this case, higher Lplus means increasing the probability of predicting low occupancy.

Use Case Explanation

An powerful app has been created based on our algorithm. We combine our prediction occupancy result with train numbers, so passenger can decide which train they would like to take. In this app, we also inlcude the search, time, location, price and other useful information. People can simply use this app to access all the train information and enjoy their rides. Below are some wireframes concepts for our app. It is easy to follow and use.

We insert our youtube video link as an alternative explanation for our concept and app.

7. Conclusion

One of the challenges we faced is the NA value in calculating the occupancy compositions. In our feature engineering process, we are trying to analysis the occupancy level of prior and after stations. Since the from stations numbers, to stations numbers and total station numbers are different, some of the stations may not have information on specific OD. In this case, it may generate occupancy NA in from or to variable. Due to the lack of data on these specific conditions, the prediction model may not be accuracy by adding these variables. As you can see in our model validation process, some of these feature engineering variables are extracted from the regression.

In conclusion, this algorithm and app are used for providing more accuracy train occupancy information and gaining more profits for the company. We are optimistic that if the passengers can have more access to train information, they are more willing to take the train. However, the price is hard to determine since it is lack of datasets. Also, the datasets we used are based on the self report from all the prior passengers, so it is hard to decide the condition of occupancy since each person has own perspective.It would be reasonable if the Belgium or company can provide prior train occupancy information based on the acutal people amount on each train. Currently, we can only obtain the train data from 4 months. We believe the more datasets we can access, the more accurancy the model could be.